Geometric Suffix Tree: A New Index Structure for Protein 3-D Structures

نویسنده

  • Tetsuo Shibuya
چکیده

Protein structure analysis is one of the most important research issues in the post-genomic era, and faster and more accurate query data structures for such 3-D structures are highly desired for research on proteins. This paper proposes a new data structure for indexing protein 3-D structures. For strings, there are many efficient indexing structures such as suffix trees, but it has been considered very difficult to design such sophisticated data structures against 3-D structures like proteins. Our index structure is based on the suffix trees and is called the geometric suffix tree. By using the geometric suffix tree for a set of protein structures, we can search for all of their substructures whose RMSDs (root mean square deviations) or URMSDs (unit-vector root mean square deviations) to a given query 3-D structure are not larger than a given bound. Though there are O(N) substructures, our data structure requires only O(N) space where N is the sum of lengths of the set of proteins. We propose an O(N) construction algorithm for it, while a naive algorithm would require O(N) time to construct it. Moreover we propose an efficient search algorithm. We also show computational experiments to demonstrate the practicality of our data structure. The experiments show that the construction time of the geometric suffix tree is practically almost linear to the size of the database, when applied to a protein structure database.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prefix-Shuffled Geometric Suffix Tree

Protein structure analysis is one of the most important research issues in the post-genomic era, and faster and more accurate index data structures for such 3-D structures are highly desired for research on proteins. The geometric suffix tree is a very sophisticated index structure that enables fast and accurate search on protein 3-D structures. By using it, we can search from 3-D structure dat...

متن کامل

PSIST: A Scalable Approach to Indexing

6 Approaches for indexing proteins, and for fast and scalable searching for struc7 tures similar to a query structure have important applications such as protein struc8 ture and function prediction, protein classification and drug discovery. In this paper, 9 we develop a new method for extracting local structural (or geometric) features from 10 protein structures. These feature vectors are in t...

متن کامل

A Comparison of Suffix Tree based Indexing and Search Techniques for Querying Protein Structures

Biological research comes across different protein structures inside a cell which may be required to map to known proteins to quickly determine their functionality. Efficient techniques for searching a protein structure in a database containing all the known proteins are needed to classify the protein and predict its function. Comparing the structure of unknown protein individually with every p...

متن کامل

Compact Suffix Trees Resemble PATRICIA Tries: Limiting Distribution of the Depth

Suffix trees are the most frequently used data structures in algorithms on words. In this paper, we consider the depth of a compact suffix tree, also known as the PAT tree, under some simple probabilistic assumptions. For a biased memoryless source, we prove that the limiting distribution for the depth in a PAT tree is the same as the limiting distribution for the depth in a PATRICIA trie, even...

متن کامل

On the Construction of Classes of Suffix Trees for Square Matrices: Algorithms and Applications

We provide a uniform framework for the study of index data structures for a two-dimensional matrix TEXT[1 : n, 1 : n] whose entries are drawn from an ordered alphabet 7. An index for TEXT can be informally seen as the two-dimensional analog of the suffix tree for a string. It allows on-line searches and statistics to be performed on TEXT by representing compactly the 3(n) square submatrices of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006